Apply learning rate scaling to min_lr #64
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I wanted to clarify how the linear learning rate scaling should be handled; this is sort of an edge case, but I think in order to obtain the correct learning rate when continuing the training from checkpoints on a different GPU count/with a different batch size, it would be required to also scale the
min_lr
?Here's a visualization of the difference between the current & the proposed implementation (for a few example parameters,
epochs = 42, niter_per_ep = 100, lr = 1e-4, lr_scale = 5
):It would be great to confirm which implementation produces the desired behavior; if it is the current version, I would propose to add a short inline comment to clarify. Thanks for looking into it!
To repro the plot and play around with other values, here's a small script: